[CDF] to_cdf #519

UnravelSports · 2025-12-17T10:24:32Z

This is a continuation of stephTchembeu#5 and #513. Due to all the misalignments it was easier to create a new PR.

Overview

This PR adds TrackingDataset.to_cdf() functionality. It includes @koenvo improved writing PR #515 and it includes the cleaned up version of work done by @stephTchembeu.

Basically, we can now output kloppy tracking data to the Common Data Format (Anzer et al. 2025).

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl'
)

Because kloppy does not process some mandatory values for the CDF (stadium id, competition id, season id, version (tracking) and collection timing) doing the above will throw some warnings, namely:

UserWarning: Missing mandatory ID at 'competition.id'. Currently replaced with the value 'MISSING_MANDATORY_VALUE'. Please provide the correct value to 'additional_metadata' to completely adhere to the CDF specification.

We can resolve this by passing additional_metadata to the to_cdf functionality, using the Common Data Format Validator TypedDicts (you don't have to use this, but it helps keep everything in the correct schema), like so:

from cdf.domain import CdfMetaDataSchema, Stadium, Competition, Season, Meta, Tracking

additional_meta_data = CdfMetaDataSchema(
    competition=Competition(
        id="COMP_123",
        name="Test Competition",
        format="league_20"
    ),
    season=Season(id="SEASON_2024", name="2024/25"),
    stadium=Stadium(
        id="STADIUM_456",
        name="Test Arena",
        turf="grass",
    ),
    meta=Meta(
        tracking=Tracking(
            version="2.0.0",
            name="TestTracker",
            fps=30,
            collection_timing="live"
        )
    )
)

We can then run:

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl',
    additional_metadata=additional_meta_data
)

This will now not throw any warnings and it should output the correct files.

Note: we set only_alive=True, because not doing so will also show a warning.

Common Data Format Validator

We have new unit tests that test the writing functionality, and tests that validate the output schema to the CDF using the common-data-format-validator. This is a development dependency. Note that if the CDF changes it's structure, these tests will fail on the kloppy side too. I can imagine this is not ideal, but not sure what to do about this. Any suggestions here are more than welcome.

Next Steps

I would like to continue with reading CDF tracking data and writing and reading CDF event data. Should I do this in a new PR, or shall I pile everything into this?

This adds comprehensive write support to the open_as_file() function with efficient memory management and streaming capabilities. Key features: - BufferedStream: SpooledTemporaryFile wrapper with chunked I/O (5MB memory threshold) - Write modes: 'wb' (write), 'ab' (append) - binary only - Adapter pattern: write_from_stream() method (opt-in for adapters) - Compression support: .gz, .bz2, .xz files handled automatically - Local files and S3 URIs supported via FSSpecAdapter - Protocols for type safety: SupportsRead, SupportsWrite Implementation details: - read_from()/write_to() methods use shutil.copyfileobj for chunked copying - Context manager pattern buffers writes and flushes on exit - No breaking changes to existing read functionality

Cleaned up CDF Serializer

UnravelSports · 2026-01-06T11:48:25Z

Added

from kloppy import cdf

dataset = cdf.load_tracking(
    meta_data="output/metadata.json",
    raw_data="output/tracking.jsonl.gz"
)

Full functionality including tests.

The implementation includes a forced validation of the incoming data, using the common-data-format-validator. This means, if you want to use the cdf.load_tracking functionality you need to have this installed. If this is not preferred, I can remove it, but it could be informative at least when people start using it in the beginning.

I can also make this validation optional of course.

probberechts · 2026-01-08T13:14:43Z

kloppy/_providers/cdf.py

+    only_alive: Optional[bool] = True,
+) -> TrackingDataset:
+    """
+    Load Common Data Format broadcast tracking data.


Does this only work for broadcast tracking data? I guess this is a copy-paste error.

probberechts · 2026-01-08T13:17:58Z

kloppy/cdf.py

@@ -0,0 +1,5 @@
+"""Functions for loading SkillCorner broadcast tracking data."""


Suggested change

"""Functions for loading SkillCorner broadcast tracking data."""

"""Functions for loading data in the Common Data Format (CDF) standard."""

I think it would be good to add a sentence explaining the CDF with a reference to the arxiv paper.

probberechts · 2026-01-08T13:20:52Z

kloppy/domain/models/common.py

+        base_coordinate_system: ProviderCoordinateSystem | None = None,
+        pitch_length: float | None = None,
+        pitch_width: float | None = None,


Should use Optional[T] here instead of T | None to be compatible with Python 3.9

probberechts · 2026-01-08T13:21:36Z

kloppy/domain/models/common.py

+
+    @property
+    def pitch_dimensions(self) -> PitchDimensions:
+        return NormalizedPitchDimensions(


I think these should be MetricPitchDimensions

probberechts · 2026-01-08T13:23:40Z

kloppy/domain/models/common.py

            raise KloppyParameterError(f"Engine {engine} is not valid")

+    def to_cdf(self):
+        if self.dataset_type != DatasetType.TRACKING:


Remove the if here and override in the TrackingDataset. A parent class should not be aware of its children.

probberechts · 2026-01-08T13:25:08Z

kloppy/domain/models/tracking.py

+
+        serializer = CDFTrackingSerializer()
+
+        # TODO: write files but also support non-local files, similar to how open_as_file supports non-local files


I think this is done now.

probberechts · 2026-01-08T13:30:26Z

kloppy/io.py

+@contextlib.contextmanager
+def _write_context_manager(
+    uri: str, mode: str
+) -> Generator[BinaryIO, None, None]:
+    """
+    Context manager for write operations that buffers writes and flushes to adapter on exit.
+
+    Args:
+        uri: The destination URI
+        mode: Write mode ('wb' or 'ab')
+
+    Yields:
+        A BufferedStream for writing
+    """
+    buffer = BufferedStream()
+    try:
+        yield buffer
+    finally:
+        adapter = get_adapter(uri)
+        if adapter:
+            adapter.write_from_stream(uri, buffer, mode)
+        else:
+            raise AdapterError(f"No adapter found for {uri}")
+
+


This should be removed. Something went wrong in the merge with master. It's now duplicated.

probberechts · 2026-01-08T13:31:08Z

kloppy/tests/files/skillcorner_v3_meta_data-2.json

Why do we need a new test file? Can't we simply use the existing SkillCorner test data?

probberechts · 2026-01-08T13:32:16Z

pyproject.toml

+    "pytest-httpserver", # Mock HTTP server for testing
+    "moto[s3,server]", # Mock AWS S3 for testing
+    "pre-commit>=4.2.0", # Git hooks for code quality checks
+    "common-data-format-validator==0.0.13",


This should not be added as a dev dependency. Create a new optional dependency group

probberechts · 2026-01-08T13:39:01Z

kloppy/infra/serializers/tracking/cdf/deserializer.py

+            pitch_dimensions=transformer.get_to_coordinate_system().pitch_dimensions,
+            frame_rate=frame_rate,
+            orientation=orientation,
+            provider=Provider.CDF,


I'm not sure whether this is desired. I believe the value of this attribute should be the actual data provider. Not the data representation standard used. Doesn't the CDF store the actual data provider?

probberechts · 2026-01-08T13:44:26Z

Can you also add some documentation on how to load CDF data and how to export a dataset to the CDF format?

stephTchembeu and others added 27 commits October 28, 2025 09:42

squash feta/to_cdf

e252c91

squash

1e2d6e6

cleaned up CDF Serializer

8eaafbb

WIP: add write support

fbf20c6

Follow output pattern for SportsCode

3e1073a

merge pr-515

e249156

working writer

f419105

cdf improve write

7993ec8

io

9e00f8e

improved (now complete) skillcorner position mapping

6b904b2

remove setup.py

87883a2

add import error

c731a76

improved tests, 2 providers

3100eb5

skillcorner additional test files

b72f03e

improved meta data

c4055e7

remove error, add warning

5ffa0d0

fix test

766e2f0

Merge pull request PySport#5 from UnravelSports/pr-513

769ab73

Cleaned up CDF Serializer

Merge branch 'master' into new/to_cdf

9ba3f97

failing tests

75da5d6

add common-data-format-validator as dev dependency

f9b7e84

start

31f0c26

ruff

7d2ee8d

uv

a954f49

uv

29ab3e1

fix?

e1bb431

UnravelSports changed the title ~~[test]~~ [CDF] to_cdf Dec 17, 2025

UnravelSports requested review from probberechts and removed request for probberechts December 17, 2025 11:58

UnravelSports requested review from koenvo and probberechts December 17, 2025 11:58

ruff

f256970

probberechts and others added 2 commits January 8, 2026 14:11

Merge branch 'master' into pr-513

b33bd48

Delete kloppy/tests/test_write_support.py

f846aa3

probberechts requested changes Jan 8, 2026

View reviewed changes

		@@ -0,0 +1,5 @@
		"""Functions for loading SkillCorner broadcast tracking data."""

	"""Functions for loading SkillCorner broadcast tracking data."""
	"""Functions for loading data in the Common Data Format (CDF) standard."""


		serializer = CDFTrackingSerializer()

		# TODO: write files but also support non-local files, similar to how open_as_file supports non-local files

[CDF] to_cdf #519

Are you sure you want to change the base?

[CDF] to_cdf #519

Uh oh!

Conversation

UnravelSports commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Common Data Format Validator

Next Steps

Uh oh!

UnravelSports commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

probberechts commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UnravelSports commented Dec 17, 2025 •

edited

Loading